-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[Clang] Add warnings when mixing different charN_t types #138708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
charN_t represent code units of different UTF encodings. Therefore the values of 2 different charN_t objects do not represent the same characters. In order to avoid comparing apples and oranges, we add new warnings to warn on: - Implicit conversions - Comparisons - Other cases involving arithmetic conversions We only produce the warning if we cannot establish the comparison would be safe through constant evaluation. The new `-Wimplicit-unicode-conversion` warning is enabled by default. Note that this PR intentionally doesn;t touches char/wchar_t, but it would be worth considering also warning on extending the new warnings to these types (in a follow up) Additionally most arithmetic operations on charN_t don't really make sense (ie what does it mean to addition code units), so we could add warnings for that. Fixes llvm#138526
@llvm/pr-subscribers-libcxx @llvm/pr-subscribers-clang Author: cor3ntin (cor3ntin) ChangescharN_t represent code units of different UTF encodings. Therefore the values of 2 different charN_t objects do not represent the same characters. In order to avoid comparing apples and oranges, we add new warnings to warn on:
We only produce the warning if we cannot establish the comparison would be safe through constant evaluation. The new Note that this PR intentionally doesn;t touches char/wchar_t, but it would be worth considering also warning on extending the new warnings to these types (in a follow up) Additionally most arithmetic operations on charN_t don't really make sense (ie what does it mean to addition code units), so we could add warnings for that. Fixes #138526 Patch is 24.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138708.diff 12 Files Affected:
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 203958dab7430..3a42f43d79fd1 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -503,6 +503,10 @@ Improvements to Clang's diagnostics
- ``-Wreserved-identifier`` now fires on reserved parameter names in a function
declaration which is not a definition.
+- A new ``-Wimplicit-unicode-conversion`` warns where comparing or implicitly converting
+ between different Unicode character types (``char8_t``, ``char16_t``, ``char32_t``).
+ This warning only triggers in C++ as these types are aliases in C. (#GH138526)
+
Improvements to Clang's time-trace
----------------------------------
diff --git a/clang/include/clang/AST/ASTDiagnostic.h b/clang/include/clang/AST/ASTDiagnostic.h
index ef22249828629..baa410e3e4a03 100644
--- a/clang/include/clang/AST/ASTDiagnostic.h
+++ b/clang/include/clang/AST/ASTDiagnostic.h
@@ -38,6 +38,9 @@ namespace clang {
/// is initialized before passing it in.
QualType desugarForDiagnostic(ASTContext &Context, QualType QT,
bool &ShouldAKA);
+
+ std::string FormatUTFCodeUnitAsCodepoint(unsigned Value, QualType T);
+
} // end namespace clang
#endif
diff --git a/clang/include/clang/AST/Type.h b/clang/include/clang/AST/Type.h
index 02a6fb5333538..7fca11fb708cf 100644
--- a/clang/include/clang/AST/Type.h
+++ b/clang/include/clang/AST/Type.h
@@ -2524,6 +2524,7 @@ class alignas(TypeAlignment) Type : public ExtQualsTypeCommonBase {
bool isChar16Type() const;
bool isChar32Type() const;
bool isAnyCharacterType() const;
+ bool isUnicodeCharacterType() const;
bool isIntegralType(const ASTContext &Ctx) const;
/// Determine whether this type is an integral or enumeration type.
diff --git a/clang/include/clang/Basic/DiagnosticGroups.td b/clang/include/clang/Basic/DiagnosticGroups.td
index 1faf8508121f4..e5b5dbbd07f10 100644
--- a/clang/include/clang/Basic/DiagnosticGroups.td
+++ b/clang/include/clang/Basic/DiagnosticGroups.td
@@ -111,6 +111,7 @@ def EnumConversion : DiagGroup<"enum-conversion",
ImplicitEnumEnumCast,
EnumFloatConversion,
EnumCompareConditional]>;
+def ImplicitUnicodeConversion : DiagGroup<"implicit-unicode-conversion">;
def DeprecatedOFast : DiagGroup<"deprecated-ofast">;
def ObjCSignedCharBoolImplicitIntConversion :
DiagGroup<"objc-signed-char-bool-implicit-int-conversion">;
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index e5a7cdc14a737..a018f6693cff2 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -4357,6 +4357,26 @@ def warn_address_of_reference_bool_conversion : Warning<
"code; pointer may be assumed to always convert to true">,
InGroup<UndefinedBoolConversion>;
+def warn_impcast_unicode_char_type : Warning<
+ "implicit conversion from %0 to %1 may change the meaning of the represented code unit">,
+ InGroup<ImplicitUnicodeConversion>;
+def warn_impcast_unicode_precision : Warning<
+ "implicit conversion from %0 to %1 may lose precision and change the meaning of the represented code unit">,
+ InGroup<ImplicitUnicodeConversion>;
+def warn_impcast_unicode_char_type_constant
+ : Warning<"implicit conversion from %0 to %1 changes the meaning of the "
+ "%select{code unit|codepoint}2 '%3'">,
+ InGroup<ImplicitUnicodeConversion>;
+
+def warn_comparison_unicode_mixed_types : Warning<
+ "comparing values of different Unicode code unit types %0 and %1 may compare different codepoints">,
+ InGroup<ImplicitUnicodeConversion>;
+
+def warn_comparison_unicode_mixed_types_constant
+ : Warning<"comparing values of different Unicode code unit types %0 and %1 "
+ "compares unrelated code units '%2' and '%3'">,
+ InGroup<ImplicitUnicodeConversion>;
+
def warn_xor_used_as_pow : Warning<
"result of '%0' is %1; did you mean exponentiation?">,
InGroup<XorUsedAsPow>;
@@ -7719,6 +7739,11 @@ def warn_comparison_of_mixed_enum_types_switch : Warning<
"%diff{ ($ and $)|}0,1">,
InGroup<EnumCompareSwitch>;
+def warn_arith_conv_mixed__unicode_types
+ : Warning<"%sub{select_arith_conv_kind}0 "
+ "different Unicode character types %1 and %2">,
+ InGroup<ImplicitUnicodeConversion>;
+
def err_typecheck_assign_const : Error<
"%select{"
"cannot assign to return value because function %1 returns a const value|"
diff --git a/clang/lib/AST/ASTDiagnostic.cpp b/clang/lib/AST/ASTDiagnostic.cpp
index 6cb09b0492ac9..0c9f50fb1a01c 100644
--- a/clang/lib/AST/ASTDiagnostic.cpp
+++ b/clang/lib/AST/ASTDiagnostic.cpp
@@ -20,6 +20,8 @@
#include "clang/AST/TemplateBase.h"
#include "clang/AST/Type.h"
#include "llvm/ADT/StringExtras.h"
+#include "llvm/Support/ConvertUTF.h"
+#include "llvm/Support/Format.h"
#include "llvm/Support/raw_ostream.h"
using namespace clang;
@@ -2190,3 +2192,30 @@ static bool FormatTemplateTypeDiff(ASTContext &Context, QualType FromType,
TD.DiffTemplate();
return TD.Emit();
}
+
+std::string clang::FormatUTFCodeUnitAsCodepoint(unsigned Value, QualType T) {
+ auto IsSingleCodeUnitCP = [](unsigned Value, QualType T) {
+ if (T->isChar8Type()) {
+ assert(Value <= 0xFF && "not a valid UTF-8 code unit");
+ return Value <= 0x7F;
+ }
+ if (T->isChar16Type()) {
+ assert(Value <= 0xFFFF && "not a valid UTF-16 code unit");
+ return llvm::IsSingleCodeUnitUTF16Codepoint(Value);
+ }
+ return llvm::IsSingleCodeUnitUTF32Codepoint(Value);
+ };
+ llvm::SmallVector<char, 4> Str;
+ if (!IsSingleCodeUnitCP(Value, T)) {
+ llvm::raw_svector_ostream OS(Str);
+ OS << "<" << llvm::format_hex(Value, 1, /*Upper=*/true) << ">";
+ return std::string(Str.begin(), Str.end());
+ }
+
+ char Buffer[UNI_MAX_UTF8_BYTES_PER_CODE_POINT];
+ char *Ptr = Buffer;
+ [[maybe_unused]] bool Converted = llvm::ConvertCodePointToUTF8(Value, Ptr);
+ assert(Converted && "trying to encode invalid code unit");
+ EscapeStringForDiagnostic(StringRef(Buffer, Ptr - Buffer), Str);
+ return std::string(Str.begin(), Str.end());
+}
diff --git a/clang/lib/AST/Type.cpp b/clang/lib/AST/Type.cpp
index fbd09141bc541..2da63b13faf9d 100644
--- a/clang/lib/AST/Type.cpp
+++ b/clang/lib/AST/Type.cpp
@@ -2193,6 +2193,20 @@ bool Type::isAnyCharacterType() const {
}
}
+bool Type::isUnicodeCharacterType() const {
+ const auto *BT = dyn_cast<BuiltinType>(CanonicalType);
+ if (!BT)
+ return false;
+ switch (BT->getKind()) {
+ default:
+ return false;
+ case BuiltinType::Char8:
+ case BuiltinType::Char16:
+ case BuiltinType::Char32:
+ return true;
+ }
+}
+
/// isSignedIntegerType - Return true if this is an integer type that is
/// signed, according to C99 6.2.5p4 [char, signed char, short, int, long..],
/// an enum decl which has a signed representation
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index 97f623f61a405..d12b5cea37aa6 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -14,6 +14,7 @@
#include "CheckExprLifetime.h"
#include "clang/AST/APValue.h"
#include "clang/AST/ASTContext.h"
+#include "clang/AST/ASTDiagnostic.h"
#include "clang/AST/Attr.h"
#include "clang/AST/AttrIterator.h"
#include "clang/AST/CharUnits.h"
@@ -11810,6 +11811,46 @@ static void DiagnoseIntInBoolContext(Sema &S, Expr *E) {
}
}
+static void DiagnoseMixedUnicodeImplicitConversion(Sema &S, const Type *Source,
+ const Type *Target, Expr *E,
+ QualType T,
+ SourceLocation CC) {
+ assert(Source->isUnicodeCharacterType() && Target->isUnicodeCharacterType() &&
+ Source != Target);
+ Expr::EvalResult Result;
+ if (E->EvaluateAsInt(Result, S.getASTContext(), Expr::SE_AllowSideEffects,
+ S.isConstantEvaluatedContext())) {
+ llvm::APSInt Value(32);
+ Value = Result.Val.getInt();
+ bool IsASCII = Value <= 0x7F;
+ bool IsBMP = Value <= 0xD7FF || (Value >= 0xE000 && Value <= 0xFFFF);
+ bool ConversionPreservesSemantics =
+ IsASCII || (!Source->isChar8Type() && !Target->isChar8Type() && IsBMP);
+
+ if (!ConversionPreservesSemantics) {
+ auto IsSingleCodeUnitCP = [](const QualType &T,
+ const llvm::APSInt &Value) {
+ if (T->isChar8Type())
+ return llvm::IsSingleCodeUnitUTF8Codepoint(Value.getExtValue());
+ if (T->isChar16Type())
+ return llvm::IsSingleCodeUnitUTF16Codepoint(Value.getExtValue());
+ return llvm::IsSingleCodeUnitUTF32Codepoint(Value.getExtValue());
+ };
+
+ S.Diag(CC, diag::warn_impcast_unicode_char_type_constant)
+ << E->getType() << T
+ << IsSingleCodeUnitCP(E->getType().getUnqualifiedType(), Value)
+ << FormatUTFCodeUnitAsCodepoint(Value.getExtValue(), E->getType());
+ }
+ } else {
+ bool LosesPrecision = S.getASTContext().getIntWidth(E->getType()) >
+ S.getASTContext().getIntWidth(T);
+ DiagnoseImpCast(S, E, T, CC,
+ LosesPrecision ? diag::warn_impcast_unicode_precision
+ : diag::warn_impcast_unicode_char_type);
+ }
+}
+
void Sema::CheckImplicitConversion(Expr *E, QualType T, SourceLocation CC,
bool *ICContext, bool IsListInit) {
if (E->isTypeDependent() || E->isValueDependent()) return;
@@ -12147,6 +12188,13 @@ void Sema::CheckImplicitConversion(Expr *E, QualType T, SourceLocation CC,
DiscardMisalignedMemberAddress(Target, E);
+
+ if(Source->isUnicodeCharacterType() && Target->isUnicodeCharacterType()) {
+ DiagnoseMixedUnicodeImplicitConversion(*this, Source, Target, E, T, CC);
+ return;
+ }
+
+
if (Target->isBooleanType())
DiagnoseIntInBoolContext(*this, E);
diff --git a/clang/lib/Sema/SemaExpr.cpp b/clang/lib/Sema/SemaExpr.cpp
index be3f145f3c5f1..b0080b778db61 100644
--- a/clang/lib/Sema/SemaExpr.cpp
+++ b/clang/lib/Sema/SemaExpr.cpp
@@ -15,6 +15,7 @@
#include "UsedDeclVisitor.h"
#include "clang/AST/ASTConsumer.h"
#include "clang/AST/ASTContext.h"
+#include "clang/AST/ASTDiagnostic.h"
#include "clang/AST/ASTLambda.h"
#include "clang/AST/ASTMutationListener.h"
#include "clang/AST/CXXInheritance.h"
@@ -1567,6 +1568,72 @@ void Sema::checkEnumArithmeticConversions(Expr *LHS, Expr *RHS,
}
}
+static void CheckUnicodeArithmeticConversions(Sema & SemaRef,
+ Expr *LHS,
+ Expr *RHS,
+ SourceLocation Loc,
+ ArithConvKind ACK) {
+ QualType LHSType = LHS->getType().getUnqualifiedType();
+ QualType RHSType = RHS->getType().getUnqualifiedType();
+
+ if(!SemaRef.getLangOpts().CPlusPlus ||
+ !LHSType->isUnicodeCharacterType() || !RHSType->isUnicodeCharacterType())
+ return;
+
+ if(ACK == ArithConvKind::Comparison) {
+ if (SemaRef.getASTContext().hasSameType(LHSType, RHSType))
+ return;
+
+ Expr::EvalResult LHSRes, RHSRes;
+ bool Success = LHS->EvaluateAsInt(LHSRes, SemaRef.getASTContext(),
+ Expr::SE_AllowSideEffects,
+ SemaRef.isConstantEvaluatedContext());
+ if (Success)
+ Success = RHS->EvaluateAsInt(RHSRes, SemaRef.getASTContext(),
+ Expr::SE_AllowSideEffects,
+ SemaRef.isConstantEvaluatedContext());
+ if (Success) {
+ llvm::APSInt LHSValue(32);
+ LHSValue = LHSRes.Val.getInt();
+ llvm::APSInt RHSValue(32);
+ RHSValue = RHSRes.Val.getInt();
+
+ auto IsSingleCodeUnitCP = [](const QualType &T,
+ const llvm::APSInt &Value) {
+ if (T->isChar8Type())
+ return llvm::IsSingleCodeUnitUTF8Codepoint(Value.getExtValue());
+ if (T->isChar16Type())
+ return llvm::IsSingleCodeUnitUTF16Codepoint(Value.getExtValue());
+ return llvm::IsSingleCodeUnitUTF32Codepoint(Value.getExtValue());
+ };
+
+ bool LHSSafe = IsSingleCodeUnitCP(LHSType, LHSValue);
+ bool RHSSafe = IsSingleCodeUnitCP(RHSType, RHSValue);
+ if (LHSSafe && RHSSafe)
+ return;
+
+ SemaRef.Diag(Loc, diag::warn_comparison_unicode_mixed_types_constant)
+ << LHS->getSourceRange() << RHS->getSourceRange() << LHSType
+ << RHSType
+ << FormatUTFCodeUnitAsCodepoint(LHSValue.getExtValue(), LHSType)
+ << FormatUTFCodeUnitAsCodepoint(RHSValue.getExtValue(), RHSType);
+ return;
+ }
+ SemaRef.Diag(Loc, diag::warn_comparison_unicode_mixed_types)
+ << LHS->getSourceRange() << RHS->getSourceRange()
+ << LHSType << RHSType;
+ return;
+ }
+
+ if (SemaRef.getASTContext().hasSameType(LHSType, RHSType))
+ return;
+
+ SemaRef.Diag(Loc, diag::warn_arith_conv_mixed__unicode_types)
+ << LHS->getSourceRange() << RHS->getSourceRange() << ACK << LHSType
+ << RHSType;
+ return;
+}
+
/// UsualArithmeticConversions - Performs various conversions that are common to
/// binary operators (C99 6.3.1.8). If both operands aren't arithmetic, this
/// routine returns the first non-arithmetic type found. The client is
@@ -1574,8 +1641,12 @@ void Sema::checkEnumArithmeticConversions(Expr *LHS, Expr *RHS,
QualType Sema::UsualArithmeticConversions(ExprResult &LHS, ExprResult &RHS,
SourceLocation Loc,
ArithConvKind ACK) {
+
checkEnumArithmeticConversions(LHS.get(), RHS.get(), Loc, ACK);
+ CheckUnicodeArithmeticConversions(*this, LHS.get(), RHS.get(),
+ Loc, ACK);
+
if (ACK != ArithConvKind::CompAssign) {
LHS = UsualUnaryConversions(LHS.get());
if (LHS.isInvalid())
diff --git a/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
new file mode 100644
index 0000000000000..41794b15175b5
--- /dev/null
+++ b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
@@ -0,0 +1,155 @@
+// RUN: %clang_cc1 -verify -fsyntax-only -std=c++20 -Wconversion %s
+
+void c8(char8_t);
+void c16(char16_t);
+void c32(char32_t);
+
+void test(char8_t u8, char16_t u16, char32_t u32) {
+ c8(u8);
+ c8(u16); // expected-warning {{implicit conversion from 'char16_t' to 'char8_t' may lose precision and change the meaning of the represented code unit}}
+ c8(u32); // expected-warning {{implicit conversion from 'char32_t' to 'char8_t' may lose precision and change the meaning of the represented code unit}}
+
+ c16(u8); // expected-warning {{implicit conversion from 'char8_t' to 'char16_t' may change the meaning of the represented code unit}}
+ c16(u16);
+ c16(u32); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' may lose precision and change the meaning of the represented code unit}}
+
+ c32(u8); // expected-warning {{implicit conversion from 'char8_t' to 'char32_t' may change the meaning of the represented code unit}}
+ c32(u16); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' may change the meaning of the represented code unit}}
+ c32(u32);
+
+
+ c8(char32_t(0x7f));
+ c8(char32_t(0x80)); // expected-warning {{implicit conversion from 'char32_t' to 'char8_t' changes the meaning of the codepoint '<U+0080>'}}
+
+ c8(char16_t(0x7f));
+ c8(char16_t(0x80)); // expected-warning {{implicit conversion from 'char16_t' to 'char8_t' changes the meaning of the codepoint '<U+0080>'}}
+ c8(char16_t(0xD800)); // expected-warning {{implicit conversion from 'char16_t' to 'char8_t' changes the meaning of the code unit '<0xD800>'}}
+ c8(char16_t(0xE000)); // expected-warning {{implicit conversion from 'char16_t' to 'char8_t' changes the meaning of the codepoint '<U+E000>'}}
+
+
+ c16(char32_t(0x7f));
+ c16(char32_t(0x80));
+ c16(char32_t(0xD7FF));
+ c16(char32_t(0xD800)); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code unit '<0xD800>'}}
+ c16(char32_t(0xE000));
+ c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the codepoint '🐉'}}
+
+
+ c32(char8_t(0x7f));
+ c32(char8_t(0x80)); // expected-warning {{implicit conversion from 'char8_t' to 'char32_t' changes the meaning of the code unit '<0x80>'}}
+ c32(char8_t(0xFF)); // expected-warning {{implicit conversion from 'char8_t' to 'char32_t' changes the meaning of the code unit '<0xFF>'}}
+
+
+ c32(char16_t(0x7f));
+ c32(char16_t(0x80));
+
+ c32(char16_t(0xD7FF));
+ c32(char16_t(0xD800)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xD800>'}}
+ c32(char16_t(0xDFFF)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xDFFF>'}}
+ c32(char16_t(0xE000));
+ c32(char16_t(u'☕'));
+
+ (void)static_cast<char32_t>(char8_t(0x80)); // sanity check: no explicit conversion;
+
+ using Char8 = char8_t;
+ Char8 c81 = u16; // expected-warning {{implicit conversion from 'char16_t' to 'Char8' (aka 'char8_t') may lose precision and change the meaning of the represented code unit}}
+
+ [[maybe_unused]] char c = u16; // expected-warning {{implicit conversion loses integer precision: 'char16_t' to 'char'}}
+
+ // FIXME: We should apply the same logic to wchar
+ [[maybe_unused]] wchar_t wc = u16;
+ [[maybe_unused]] wchar_t wc2 = u8;
+}
+
+void test_comp(char8_t u8, char16_t u16, char32_t u32) {
+ (void)(u8 == u8' ');
+ (void)(u8 == u' '); // expected-warning{{comparing values of different Unicode code unit types 'char8_t' and 'char16_t' may compare different codepoints}}
+ (void)(u8 == U' '); // expected-warning{{comparing values of different Unicode code unit types 'char8_t' and 'char32_t' may compare different codepoints}}
+
+ (void)(u16 == u8' '); // expected-warning{{comparing values of different Unicode code unit types 'char16_t' and 'char8_t' may compare different codepoints}}
+ (void)(u16 == u' ');
+ (void)(u16 == U' '); // expected-warning{{comparing values of different Unicode code unit types 'char16_t' and 'char32_t' may compare different codepoints}}
+
+ (void)(u32 == u8' '); // expected-warning{{comparing values of different Unicode code unit types 'char32_t' and 'char8_t' may compare different codepoints}}
+ (void)(u32 == u' '); // expected-warning{{comparing values of different Unicode code unit types 'char32_t' and 'char16_t' may compare different codepoints}}
+ (void)(u32 == U' ');
+
+
+ (void)(u8' ' == u' ');
+ (void)(u8' ' == u' ');
+
+
+ (void)(u8 == U' '); // expected-warning{{comparing values of different Unicode code unit types 'char8_t' and 'char32_t' may compare different codepoints}}
+ (void)(u16 == u8' '); // expected-warning{{comparing values of different Unicode code unit types 'char16_t' and 'char8_t' may compare different codepoints}}
+ (void)(u16 == u' ');
+ (void)(u16 == U' '); // expected-warning{{comparing values of different Unicode code unit types 'char16_t' and 'char32_t' may compare different codepoints}}
+
+ (void)(u32 == u8' '); // expected-warning{{comparing values of different Unicode code unit types 'char32_t' and 'char8_t' may compare different codepoints}}
+ (void)(u32 == u' '); // expected-warning{{comparing values of different Unicode code unit types 'char32_t' and 'char16_t' may compare different codepoints}}
+ (void)(u32 == U' ');
+
+
+ (void)(char8_t(0x7f) == char8_t(0x7f));
+ (void)(char8_t(0x7f) == char16_t(0x7f));
+ (void)(char8_t(0x7f) == char32_t(0x7f));
+
+ (void)(char8_t(0x80) == char8_t(0x80));
+ (void)(char8_t(0x80) == char16_t(0x80)); // expected-warning{{comparing values of different Unicode code unit types 'char8_t' and 'char16_t' compares unrelated code units '<0x...
[truncated]
|
@llvm/pr-subscribers-llvm-support Author: cor3ntin (cor3ntin) ChangescharN_t represent code units of different UTF encodings. Therefore the values of 2 different charN_t objects do not represent the same characters. In order to avoid comparing apples and oranges, we add new warnings to warn on:
We only produce the warning if we cannot establish the comparison would be safe through constant evaluation. The new Note that this PR intentionally doesn;t touches char/wchar_t, but it would be worth considering also warning on extending the new warnings to these types (in a follow up) Additionally most arithmetic operations on charN_t don't really make sense (ie what does it mean to addition code units), so we could add warnings for that. Fixes #138526 Patch is 24.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138708.diff 12 Files Affected:
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 203958dab7430..3a42f43d79fd1 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -503,6 +503,10 @@ Improvements to Clang's diagnostics
- ``-Wreserved-identifier`` now fires on reserved parameter names in a function
declaration which is not a definition.
+- A new ``-Wimplicit-unicode-conversion`` warns where comparing or implicitly converting
+ between different Unicode character types (``char8_t``, ``char16_t``, ``char32_t``).
+ This warning only triggers in C++ as these types are aliases in C. (#GH138526)
+
Improvements to Clang's time-trace
----------------------------------
diff --git a/clang/include/clang/AST/ASTDiagnostic.h b/clang/include/clang/AST/ASTDiagnostic.h
index ef22249828629..baa410e3e4a03 100644
--- a/clang/include/clang/AST/ASTDiagnostic.h
+++ b/clang/include/clang/AST/ASTDiagnostic.h
@@ -38,6 +38,9 @@ namespace clang {
/// is initialized before passing it in.
QualType desugarForDiagnostic(ASTContext &Context, QualType QT,
bool &ShouldAKA);
+
+ std::string FormatUTFCodeUnitAsCodepoint(unsigned Value, QualType T);
+
} // end namespace clang
#endif
diff --git a/clang/include/clang/AST/Type.h b/clang/include/clang/AST/Type.h
index 02a6fb5333538..7fca11fb708cf 100644
--- a/clang/include/clang/AST/Type.h
+++ b/clang/include/clang/AST/Type.h
@@ -2524,6 +2524,7 @@ class alignas(TypeAlignment) Type : public ExtQualsTypeCommonBase {
bool isChar16Type() const;
bool isChar32Type() const;
bool isAnyCharacterType() const;
+ bool isUnicodeCharacterType() const;
bool isIntegralType(const ASTContext &Ctx) const;
/// Determine whether this type is an integral or enumeration type.
diff --git a/clang/include/clang/Basic/DiagnosticGroups.td b/clang/include/clang/Basic/DiagnosticGroups.td
index 1faf8508121f4..e5b5dbbd07f10 100644
--- a/clang/include/clang/Basic/DiagnosticGroups.td
+++ b/clang/include/clang/Basic/DiagnosticGroups.td
@@ -111,6 +111,7 @@ def EnumConversion : DiagGroup<"enum-conversion",
ImplicitEnumEnumCast,
EnumFloatConversion,
EnumCompareConditional]>;
+def ImplicitUnicodeConversion : DiagGroup<"implicit-unicode-conversion">;
def DeprecatedOFast : DiagGroup<"deprecated-ofast">;
def ObjCSignedCharBoolImplicitIntConversion :
DiagGroup<"objc-signed-char-bool-implicit-int-conversion">;
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index e5a7cdc14a737..a018f6693cff2 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -4357,6 +4357,26 @@ def warn_address_of_reference_bool_conversion : Warning<
"code; pointer may be assumed to always convert to true">,
InGroup<UndefinedBoolConversion>;
+def warn_impcast_unicode_char_type : Warning<
+ "implicit conversion from %0 to %1 may change the meaning of the represented code unit">,
+ InGroup<ImplicitUnicodeConversion>;
+def warn_impcast_unicode_precision : Warning<
+ "implicit conversion from %0 to %1 may lose precision and change the meaning of the represented code unit">,
+ InGroup<ImplicitUnicodeConversion>;
+def warn_impcast_unicode_char_type_constant
+ : Warning<"implicit conversion from %0 to %1 changes the meaning of the "
+ "%select{code unit|codepoint}2 '%3'">,
+ InGroup<ImplicitUnicodeConversion>;
+
+def warn_comparison_unicode_mixed_types : Warning<
+ "comparing values of different Unicode code unit types %0 and %1 may compare different codepoints">,
+ InGroup<ImplicitUnicodeConversion>;
+
+def warn_comparison_unicode_mixed_types_constant
+ : Warning<"comparing values of different Unicode code unit types %0 and %1 "
+ "compares unrelated code units '%2' and '%3'">,
+ InGroup<ImplicitUnicodeConversion>;
+
def warn_xor_used_as_pow : Warning<
"result of '%0' is %1; did you mean exponentiation?">,
InGroup<XorUsedAsPow>;
@@ -7719,6 +7739,11 @@ def warn_comparison_of_mixed_enum_types_switch : Warning<
"%diff{ ($ and $)|}0,1">,
InGroup<EnumCompareSwitch>;
+def warn_arith_conv_mixed__unicode_types
+ : Warning<"%sub{select_arith_conv_kind}0 "
+ "different Unicode character types %1 and %2">,
+ InGroup<ImplicitUnicodeConversion>;
+
def err_typecheck_assign_const : Error<
"%select{"
"cannot assign to return value because function %1 returns a const value|"
diff --git a/clang/lib/AST/ASTDiagnostic.cpp b/clang/lib/AST/ASTDiagnostic.cpp
index 6cb09b0492ac9..0c9f50fb1a01c 100644
--- a/clang/lib/AST/ASTDiagnostic.cpp
+++ b/clang/lib/AST/ASTDiagnostic.cpp
@@ -20,6 +20,8 @@
#include "clang/AST/TemplateBase.h"
#include "clang/AST/Type.h"
#include "llvm/ADT/StringExtras.h"
+#include "llvm/Support/ConvertUTF.h"
+#include "llvm/Support/Format.h"
#include "llvm/Support/raw_ostream.h"
using namespace clang;
@@ -2190,3 +2192,30 @@ static bool FormatTemplateTypeDiff(ASTContext &Context, QualType FromType,
TD.DiffTemplate();
return TD.Emit();
}
+
+std::string clang::FormatUTFCodeUnitAsCodepoint(unsigned Value, QualType T) {
+ auto IsSingleCodeUnitCP = [](unsigned Value, QualType T) {
+ if (T->isChar8Type()) {
+ assert(Value <= 0xFF && "not a valid UTF-8 code unit");
+ return Value <= 0x7F;
+ }
+ if (T->isChar16Type()) {
+ assert(Value <= 0xFFFF && "not a valid UTF-16 code unit");
+ return llvm::IsSingleCodeUnitUTF16Codepoint(Value);
+ }
+ return llvm::IsSingleCodeUnitUTF32Codepoint(Value);
+ };
+ llvm::SmallVector<char, 4> Str;
+ if (!IsSingleCodeUnitCP(Value, T)) {
+ llvm::raw_svector_ostream OS(Str);
+ OS << "<" << llvm::format_hex(Value, 1, /*Upper=*/true) << ">";
+ return std::string(Str.begin(), Str.end());
+ }
+
+ char Buffer[UNI_MAX_UTF8_BYTES_PER_CODE_POINT];
+ char *Ptr = Buffer;
+ [[maybe_unused]] bool Converted = llvm::ConvertCodePointToUTF8(Value, Ptr);
+ assert(Converted && "trying to encode invalid code unit");
+ EscapeStringForDiagnostic(StringRef(Buffer, Ptr - Buffer), Str);
+ return std::string(Str.begin(), Str.end());
+}
diff --git a/clang/lib/AST/Type.cpp b/clang/lib/AST/Type.cpp
index fbd09141bc541..2da63b13faf9d 100644
--- a/clang/lib/AST/Type.cpp
+++ b/clang/lib/AST/Type.cpp
@@ -2193,6 +2193,20 @@ bool Type::isAnyCharacterType() const {
}
}
+bool Type::isUnicodeCharacterType() const {
+ const auto *BT = dyn_cast<BuiltinType>(CanonicalType);
+ if (!BT)
+ return false;
+ switch (BT->getKind()) {
+ default:
+ return false;
+ case BuiltinType::Char8:
+ case BuiltinType::Char16:
+ case BuiltinType::Char32:
+ return true;
+ }
+}
+
/// isSignedIntegerType - Return true if this is an integer type that is
/// signed, according to C99 6.2.5p4 [char, signed char, short, int, long..],
/// an enum decl which has a signed representation
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index 97f623f61a405..d12b5cea37aa6 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -14,6 +14,7 @@
#include "CheckExprLifetime.h"
#include "clang/AST/APValue.h"
#include "clang/AST/ASTContext.h"
+#include "clang/AST/ASTDiagnostic.h"
#include "clang/AST/Attr.h"
#include "clang/AST/AttrIterator.h"
#include "clang/AST/CharUnits.h"
@@ -11810,6 +11811,46 @@ static void DiagnoseIntInBoolContext(Sema &S, Expr *E) {
}
}
+static void DiagnoseMixedUnicodeImplicitConversion(Sema &S, const Type *Source,
+ const Type *Target, Expr *E,
+ QualType T,
+ SourceLocation CC) {
+ assert(Source->isUnicodeCharacterType() && Target->isUnicodeCharacterType() &&
+ Source != Target);
+ Expr::EvalResult Result;
+ if (E->EvaluateAsInt(Result, S.getASTContext(), Expr::SE_AllowSideEffects,
+ S.isConstantEvaluatedContext())) {
+ llvm::APSInt Value(32);
+ Value = Result.Val.getInt();
+ bool IsASCII = Value <= 0x7F;
+ bool IsBMP = Value <= 0xD7FF || (Value >= 0xE000 && Value <= 0xFFFF);
+ bool ConversionPreservesSemantics =
+ IsASCII || (!Source->isChar8Type() && !Target->isChar8Type() && IsBMP);
+
+ if (!ConversionPreservesSemantics) {
+ auto IsSingleCodeUnitCP = [](const QualType &T,
+ const llvm::APSInt &Value) {
+ if (T->isChar8Type())
+ return llvm::IsSingleCodeUnitUTF8Codepoint(Value.getExtValue());
+ if (T->isChar16Type())
+ return llvm::IsSingleCodeUnitUTF16Codepoint(Value.getExtValue());
+ return llvm::IsSingleCodeUnitUTF32Codepoint(Value.getExtValue());
+ };
+
+ S.Diag(CC, diag::warn_impcast_unicode_char_type_constant)
+ << E->getType() << T
+ << IsSingleCodeUnitCP(E->getType().getUnqualifiedType(), Value)
+ << FormatUTFCodeUnitAsCodepoint(Value.getExtValue(), E->getType());
+ }
+ } else {
+ bool LosesPrecision = S.getASTContext().getIntWidth(E->getType()) >
+ S.getASTContext().getIntWidth(T);
+ DiagnoseImpCast(S, E, T, CC,
+ LosesPrecision ? diag::warn_impcast_unicode_precision
+ : diag::warn_impcast_unicode_char_type);
+ }
+}
+
void Sema::CheckImplicitConversion(Expr *E, QualType T, SourceLocation CC,
bool *ICContext, bool IsListInit) {
if (E->isTypeDependent() || E->isValueDependent()) return;
@@ -12147,6 +12188,13 @@ void Sema::CheckImplicitConversion(Expr *E, QualType T, SourceLocation CC,
DiscardMisalignedMemberAddress(Target, E);
+
+ if(Source->isUnicodeCharacterType() && Target->isUnicodeCharacterType()) {
+ DiagnoseMixedUnicodeImplicitConversion(*this, Source, Target, E, T, CC);
+ return;
+ }
+
+
if (Target->isBooleanType())
DiagnoseIntInBoolContext(*this, E);
diff --git a/clang/lib/Sema/SemaExpr.cpp b/clang/lib/Sema/SemaExpr.cpp
index be3f145f3c5f1..b0080b778db61 100644
--- a/clang/lib/Sema/SemaExpr.cpp
+++ b/clang/lib/Sema/SemaExpr.cpp
@@ -15,6 +15,7 @@
#include "UsedDeclVisitor.h"
#include "clang/AST/ASTConsumer.h"
#include "clang/AST/ASTContext.h"
+#include "clang/AST/ASTDiagnostic.h"
#include "clang/AST/ASTLambda.h"
#include "clang/AST/ASTMutationListener.h"
#include "clang/AST/CXXInheritance.h"
@@ -1567,6 +1568,72 @@ void Sema::checkEnumArithmeticConversions(Expr *LHS, Expr *RHS,
}
}
+static void CheckUnicodeArithmeticConversions(Sema & SemaRef,
+ Expr *LHS,
+ Expr *RHS,
+ SourceLocation Loc,
+ ArithConvKind ACK) {
+ QualType LHSType = LHS->getType().getUnqualifiedType();
+ QualType RHSType = RHS->getType().getUnqualifiedType();
+
+ if(!SemaRef.getLangOpts().CPlusPlus ||
+ !LHSType->isUnicodeCharacterType() || !RHSType->isUnicodeCharacterType())
+ return;
+
+ if(ACK == ArithConvKind::Comparison) {
+ if (SemaRef.getASTContext().hasSameType(LHSType, RHSType))
+ return;
+
+ Expr::EvalResult LHSRes, RHSRes;
+ bool Success = LHS->EvaluateAsInt(LHSRes, SemaRef.getASTContext(),
+ Expr::SE_AllowSideEffects,
+ SemaRef.isConstantEvaluatedContext());
+ if (Success)
+ Success = RHS->EvaluateAsInt(RHSRes, SemaRef.getASTContext(),
+ Expr::SE_AllowSideEffects,
+ SemaRef.isConstantEvaluatedContext());
+ if (Success) {
+ llvm::APSInt LHSValue(32);
+ LHSValue = LHSRes.Val.getInt();
+ llvm::APSInt RHSValue(32);
+ RHSValue = RHSRes.Val.getInt();
+
+ auto IsSingleCodeUnitCP = [](const QualType &T,
+ const llvm::APSInt &Value) {
+ if (T->isChar8Type())
+ return llvm::IsSingleCodeUnitUTF8Codepoint(Value.getExtValue());
+ if (T->isChar16Type())
+ return llvm::IsSingleCodeUnitUTF16Codepoint(Value.getExtValue());
+ return llvm::IsSingleCodeUnitUTF32Codepoint(Value.getExtValue());
+ };
+
+ bool LHSSafe = IsSingleCodeUnitCP(LHSType, LHSValue);
+ bool RHSSafe = IsSingleCodeUnitCP(RHSType, RHSValue);
+ if (LHSSafe && RHSSafe)
+ return;
+
+ SemaRef.Diag(Loc, diag::warn_comparison_unicode_mixed_types_constant)
+ << LHS->getSourceRange() << RHS->getSourceRange() << LHSType
+ << RHSType
+ << FormatUTFCodeUnitAsCodepoint(LHSValue.getExtValue(), LHSType)
+ << FormatUTFCodeUnitAsCodepoint(RHSValue.getExtValue(), RHSType);
+ return;
+ }
+ SemaRef.Diag(Loc, diag::warn_comparison_unicode_mixed_types)
+ << LHS->getSourceRange() << RHS->getSourceRange()
+ << LHSType << RHSType;
+ return;
+ }
+
+ if (SemaRef.getASTContext().hasSameType(LHSType, RHSType))
+ return;
+
+ SemaRef.Diag(Loc, diag::warn_arith_conv_mixed__unicode_types)
+ << LHS->getSourceRange() << RHS->getSourceRange() << ACK << LHSType
+ << RHSType;
+ return;
+}
+
/// UsualArithmeticConversions - Performs various conversions that are common to
/// binary operators (C99 6.3.1.8). If both operands aren't arithmetic, this
/// routine returns the first non-arithmetic type found. The client is
@@ -1574,8 +1641,12 @@ void Sema::checkEnumArithmeticConversions(Expr *LHS, Expr *RHS,
QualType Sema::UsualArithmeticConversions(ExprResult &LHS, ExprResult &RHS,
SourceLocation Loc,
ArithConvKind ACK) {
+
checkEnumArithmeticConversions(LHS.get(), RHS.get(), Loc, ACK);
+ CheckUnicodeArithmeticConversions(*this, LHS.get(), RHS.get(),
+ Loc, ACK);
+
if (ACK != ArithConvKind::CompAssign) {
LHS = UsualUnaryConversions(LHS.get());
if (LHS.isInvalid())
diff --git a/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
new file mode 100644
index 0000000000000..41794b15175b5
--- /dev/null
+++ b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
@@ -0,0 +1,155 @@
+// RUN: %clang_cc1 -verify -fsyntax-only -std=c++20 -Wconversion %s
+
+void c8(char8_t);
+void c16(char16_t);
+void c32(char32_t);
+
+void test(char8_t u8, char16_t u16, char32_t u32) {
+ c8(u8);
+ c8(u16); // expected-warning {{implicit conversion from 'char16_t' to 'char8_t' may lose precision and change the meaning of the represented code unit}}
+ c8(u32); // expected-warning {{implicit conversion from 'char32_t' to 'char8_t' may lose precision and change the meaning of the represented code unit}}
+
+ c16(u8); // expected-warning {{implicit conversion from 'char8_t' to 'char16_t' may change the meaning of the represented code unit}}
+ c16(u16);
+ c16(u32); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' may lose precision and change the meaning of the represented code unit}}
+
+ c32(u8); // expected-warning {{implicit conversion from 'char8_t' to 'char32_t' may change the meaning of the represented code unit}}
+ c32(u16); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' may change the meaning of the represented code unit}}
+ c32(u32);
+
+
+ c8(char32_t(0x7f));
+ c8(char32_t(0x80)); // expected-warning {{implicit conversion from 'char32_t' to 'char8_t' changes the meaning of the codepoint '<U+0080>'}}
+
+ c8(char16_t(0x7f));
+ c8(char16_t(0x80)); // expected-warning {{implicit conversion from 'char16_t' to 'char8_t' changes the meaning of the codepoint '<U+0080>'}}
+ c8(char16_t(0xD800)); // expected-warning {{implicit conversion from 'char16_t' to 'char8_t' changes the meaning of the code unit '<0xD800>'}}
+ c8(char16_t(0xE000)); // expected-warning {{implicit conversion from 'char16_t' to 'char8_t' changes the meaning of the codepoint '<U+E000>'}}
+
+
+ c16(char32_t(0x7f));
+ c16(char32_t(0x80));
+ c16(char32_t(0xD7FF));
+ c16(char32_t(0xD800)); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code unit '<0xD800>'}}
+ c16(char32_t(0xE000));
+ c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the codepoint '🐉'}}
+
+
+ c32(char8_t(0x7f));
+ c32(char8_t(0x80)); // expected-warning {{implicit conversion from 'char8_t' to 'char32_t' changes the meaning of the code unit '<0x80>'}}
+ c32(char8_t(0xFF)); // expected-warning {{implicit conversion from 'char8_t' to 'char32_t' changes the meaning of the code unit '<0xFF>'}}
+
+
+ c32(char16_t(0x7f));
+ c32(char16_t(0x80));
+
+ c32(char16_t(0xD7FF));
+ c32(char16_t(0xD800)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xD800>'}}
+ c32(char16_t(0xDFFF)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xDFFF>'}}
+ c32(char16_t(0xE000));
+ c32(char16_t(u'☕'));
+
+ (void)static_cast<char32_t>(char8_t(0x80)); // sanity check: no explicit conversion;
+
+ using Char8 = char8_t;
+ Char8 c81 = u16; // expected-warning {{implicit conversion from 'char16_t' to 'Char8' (aka 'char8_t') may lose precision and change the meaning of the represented code unit}}
+
+ [[maybe_unused]] char c = u16; // expected-warning {{implicit conversion loses integer precision: 'char16_t' to 'char'}}
+
+ // FIXME: We should apply the same logic to wchar
+ [[maybe_unused]] wchar_t wc = u16;
+ [[maybe_unused]] wchar_t wc2 = u8;
+}
+
+void test_comp(char8_t u8, char16_t u16, char32_t u32) {
+ (void)(u8 == u8' ');
+ (void)(u8 == u' '); // expected-warning{{comparing values of different Unicode code unit types 'char8_t' and 'char16_t' may compare different codepoints}}
+ (void)(u8 == U' '); // expected-warning{{comparing values of different Unicode code unit types 'char8_t' and 'char32_t' may compare different codepoints}}
+
+ (void)(u16 == u8' '); // expected-warning{{comparing values of different Unicode code unit types 'char16_t' and 'char8_t' may compare different codepoints}}
+ (void)(u16 == u' ');
+ (void)(u16 == U' '); // expected-warning{{comparing values of different Unicode code unit types 'char16_t' and 'char32_t' may compare different codepoints}}
+
+ (void)(u32 == u8' '); // expected-warning{{comparing values of different Unicode code unit types 'char32_t' and 'char8_t' may compare different codepoints}}
+ (void)(u32 == u' '); // expected-warning{{comparing values of different Unicode code unit types 'char32_t' and 'char16_t' may compare different codepoints}}
+ (void)(u32 == U' ');
+
+
+ (void)(u8' ' == u' ');
+ (void)(u8' ' == u' ');
+
+
+ (void)(u8 == U' '); // expected-warning{{comparing values of different Unicode code unit types 'char8_t' and 'char32_t' may compare different codepoints}}
+ (void)(u16 == u8' '); // expected-warning{{comparing values of different Unicode code unit types 'char16_t' and 'char8_t' may compare different codepoints}}
+ (void)(u16 == u' ');
+ (void)(u16 == U' '); // expected-warning{{comparing values of different Unicode code unit types 'char16_t' and 'char32_t' may compare different codepoints}}
+
+ (void)(u32 == u8' '); // expected-warning{{comparing values of different Unicode code unit types 'char32_t' and 'char8_t' may compare different codepoints}}
+ (void)(u32 == u' '); // expected-warning{{comparing values of different Unicode code unit types 'char32_t' and 'char16_t' may compare different codepoints}}
+ (void)(u32 == U' ');
+
+
+ (void)(char8_t(0x7f) == char8_t(0x7f));
+ (void)(char8_t(0x7f) == char16_t(0x7f));
+ (void)(char8_t(0x7f) == char32_t(0x7f));
+
+ (void)(char8_t(0x80) == char8_t(0x80));
+ (void)(char8_t(0x80) == char16_t(0x80)); // expected-warning{{comparing values of different Unicode code unit types 'char8_t' and 'char16_t' compares unrelated code units '<0x...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
@@ -19,7 +19,9 @@ | |||
// equal(Iter1 first1, Iter1 last1, Iter2 first2, Iter2 last2); | |||
|
|||
// We test the cartesian product, so we sometimes compare differently signed types | |||
// ADDITIONAL_COMPILE_FLAGS(gcc-style-warnings): -Wno-sign-compare -Wno-implicit-unicode-conversion | |||
// ADDITIONAL_COMPILE_FLAGS(gcc-style-warnings): -Wno-sign-compare | |||
// ADDITIONAL_COMPILE_FLAGS(clang-21): -Wno-implicit-unicode-conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in clang-22?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, thanks for working on it! I added a number of mostly minor comments. I would like to see the warnings extended to char
and wchar_t
and doing that in a later change is fine. Some of my comments are intended to de-emphasize Unicode so that the warning category and diangostics remain applicable for non-Unicode encodings (even if we don't add diagnostics for such any time soon).
c16(char32_t(0xD7FF)); | ||
c16(char32_t(0xD800)); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code unit '<0xD800>'}} | ||
c16(char32_t(0xE000)); | ||
c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the codepoint '🐉'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is odd to use the term code point (an integer value) and then display a character (🐉). Perhaps we could say something like this:
c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the codepoint '🐉'}} | |
c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' loses the representation of the character '🐉'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except that the code point may either not be printable, or not be a character.
And the represention stays the same!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. In my mind, the term "code point" always designates an integer type or value. "Character" is more fluid; it could mean a glyph or the character designated by a code point. But it doesn't work for an invalid code point, so, hmm, ok. This isn't a new concern introduced with this change anyway. Let's just move on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking about this some more. The presentation will display an invalid code unit or code point value as '<0xXXXX>'
, an unprintable character as '<U+XXXX>'
, and a printable character as, e.g., '🐉'
. Perhaps those distinctions suffice to drop "the code point"? For example:
c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of '🐉'}}
We could also present both the source value and the post conversion result to make it more explicit how the meaning changes. For example:
c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of '🐉' to '<U+F409>'}}
Or, if char16_t
happens to be larger than 16 bits (I'm not sure we should care about this case though):
c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of '🐉' to '<0x1F409>'}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a distinction between a code unit and a code point in the diagnostic message already. Which I think is useful information. But the distinction between printable and not can just be the presentation,
We could also present both the source value and the post conversion result to make it more explicit how the meaning changes. For example:
Presenting the source is something I tried, but because one side is always going to be invalid (ie not a complete code point) it's not actually useful
Or, if char16_t happens to be larger than 16 bits (I'm not sure we should care about this case though):
This is not a scenario in Clang
…oint representable in both types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cor3ntin, this looks good to me!
c16(char32_t(0xD7FF)); | ||
c16(char32_t(0xD800)); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code unit '<0xD800>'}} | ||
c16(char32_t(0xE000)); | ||
c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the codepoint '🐉'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. In my mind, the term "code point" always designates an integer type or value. "Character" is more fluid; it could mean a glyph or the character designated by a code point. But it doesn't work for an invalid code point, so, hmm, ok. This isn't a new concern introduced with this change anyway. Let's just move on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean to be difficult, but I'm always weary about removing test coverage, and I'm not yet convinced that the tests didn't intend to test these sketchy conversions.
The way it uses the typedefs elsewhere in the same tests, while pinning some of the types initially to char16_t
, for example, suggests to me it was intentional (though not how you would want users to do it).
I'm happy to proceed without addressing my concerns if another expert can vouch for the test changes, otherwise I'm sure I'll be convinced by the answers to the inline comments.
Apologies for my paranoia.
@@ -123,7 +123,7 @@ _LIBCPP_HIDE_FROM_ABI constexpr void __encode(_OutIt& __out_it, char32_t __value | |||
_LIBCPP_ASSERT_UNCATEGORIZED(__is_scalar_value(__value), "an invalid unicode scalar value results in invalid UTF-16"); | |||
|
|||
if (__value < 0x10000) { | |||
*__out_it++ = __value; | |||
*__out_it++ = static_cast<iter_value_t<_OutIt>>(__value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this static cast will never change the value when the assertion above is enabled, right? Because we catch the potential narrowing there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup!
@@ -618,7 +618,7 @@ template <class InternT, class ExternT> | |||
void utf8_to_utf16_in_error(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) { | |||
// UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP, 4-byte CP | |||
const unsigned char input[] = "b\u0448\uD700\U0010AAAA"; | |||
const char16_t expected[] = {'b', 0x0448, 0xD700, 0xDBEA, 0xDEAA, 0}; | |||
const InternT expected[] = {0x62, 0x0448, 0xD700, 0xDBEA, 0xDEAA, 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that these tests are using the API as it was intended (before it was removed for being bad)?
As in, they are using something like std::codecvt_utf8_to_utf16
with the char16_t type in their code.
And this set of changes removes test coverage for that case, because it generates the warning about potential narrowing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the point of having the literal b
in the expected output is to match the b
in the input,
which gets totally lost with this change.
Would static_cast<InternT>('b')
work instead?
On second thought, why is the change from b
to 0x62 needed ? That will never narrow or change meaning, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes ('b) were a no-op, I reverted them
@@ -34,6 +34,6 @@ int main(int, char**) { | |||
assert(from_next - from == 9); | |||
assert(to_next - to == 9); | |||
for (unsigned i = 0; i < 9; ++i) | |||
assert(to[i] == from[i]); | |||
assert(static_cast<char32_t>(to[i]) == static_cast<char32_t>(from[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we casting both types here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the redundant cast
@@ -484,7 +484,7 @@ template <class InternT, class ExternT> | |||
void utf8_to_utf16_in_ok(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) { | |||
// UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP | |||
const unsigned char input[] = "b\u0448\uAAAA\U0010AAAA"; | |||
const char16_t expected[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0}; | |||
const InternT expected[] = {0x62, 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0}; | |||
static_assert(array_size(input) == 11, ""); | |||
static_assert(array_size(expected) == 6, ""); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github won't let you leave comments on lines outside of the change context,
but I don't understand why the test declares two copies of the expected data, the const one above, and the one it uses below, if it wasn't the intention to test this sort of type mismatch.
It seems to me the test is saying "I expect to be able to copy between these two types"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I don/t know either. but it does not test codecvt
- my guess is to be able to use a string literal and let the compiler encode to utf8? (that would explain in
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially these test were written in c++11 and the variable expected
was Unicode string literal. The variable exp
is there to get the type i want. I use the same trick for the input data, see variables input
and in
. But later I ported this test to be c++98 because appreantly libc++ supports these codecvts in that mode too, and i had to drop the usage of Unicode string literals.
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/51/builds/16212 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/66/builds/13997 Here is the relevant piece of the build log for the reference
|
Hi, this PR broke a few buildbots. Could you take a look? https://lab.llvm.org/buildbot/#/builders/66/builds/13997 |
I'm investigating the CI failure |
Silence a warning in gtest casting a char8_t/char16_t to char32_t. Note that this cast, as well as the behavior of `PrintTo(char32_t)` is incorrect when printing a code unit that does not represent a code point.
Silence a warning in gtest casting a char8_t/char16_t to char32_t. Note that this cast, as well as the behavior of `PrintTo(char32_t)` is incorrect when printing a code unit that does not represent a code point. See google/googletest#4762
@qinkunbao I committed a fix - I'll try to monitor for further issues, thanks! |
…m#138708)" breaks build of rocSOLVER [2025-05-15T22:58:37.770Z] /home/jenkins/agent/workspace/compiler-psdb-amd-staging/Libs/rocSOLVER/build/deps/gtest-1.11.0-src/googletest/include/gtest/gtest-printers.h:474:35: error: implicit conversion from 'char16_t' to 'char32_t' may change the meaning of the represented code unit [-Werror,-Wcharacter-conversion] [2025-05-15T22:58:37.770Z] 474 | PrintTo(ImplicitCast_<char32_t>(c), os); [2025-05-15T22:58:37.771Z] | ~~~~~~~~~~~~~ ^ [2025-05-15T22:58:38.975Z] 1 error generated. This reverts commit 381a649.
llvm/llvm-project#138708 added new for converting different charN_t types, which trigger in the googletest build. Disable this warning until it's fixed upstream.
llvm/llvm-project#138708 added a new warning for converting different charN_t types, which trigger in the googletest build. Disable this warning until it's fixed upstream.
llvm/llvm-project#138708 added a new warning for converting different charN_t types, which triggers in the googletest build. Disable this warning until it's fixed upstream.
charN_t represent code units of different UTF encodings. Therefore the values of 2 different charN_t objects do not represent the same characters. In order to avoid comparing apples and oranges, we add new warnings to warn on: - Implicit conversions - Comparisons - Other cases involving arithmetic conversions We only produce the warning if we cannot establish the comparison would be safe through constant evaluation. The new `-Wimplicit-unicode-conversion` warning is enabled by default. Note that this PR intentionally doesn;t touches char/wchar_t, but it would be worth considering also warning on extending the new warnings to these types (in a follow up) Additionally most arithmetic operations on charN_t don't really make sense (ie what does it mean to addition code units), so we could add warnings for that. Fixes llvm#138526
Clang has recently added "warnings when mixing different charN_t types" [1]. The rationale is that "charN_t represent code units of different UTF encodings. Therefore the values of 2 different charN_t objects do not represent the same characters." Note that the warning here may be legitimate - from #4762: "[...] This is incorrect for values that do not represent valid codepoints." For the time being, silence the warning by being more explicit about the conversion being intentional by using static_cast. Link: llvm/llvm-project#138708 [1] PiperOrigin-RevId: 760644157 Change-Id: I2e6cc1871975455cecac8731b2f93fd5beeaf0e1
llvm/llvm-project#138708 now warns when mixing charN_types. This conversion warning is suppressed for aidl's dependency on third_party/googletest until this gets resolved in upstream googletest: google/googletest#4762 Bug: 418047052 Change-Id: I0e6e6fcd6d37a018a0d7933217bd58c6f3b45d39 Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/1279146 Commit-Queue: Auto-Submit <auto-submit-builder@fuchsia-internal-service-accts.iam.gserviceaccount.com> Fuchsia-Auto-Submit: Caslyn Tonelli <caslyn@google.com> Reviewed-by: Aaron Wood <aaronwood@google.com>
llvm/llvm-project#138708 now warns when mixing charN_types. This conversion warning is suppressed for aidl's dependency on third_party/googletest until this gets resolved in upstream googletest: google/googletest#4762 Original-Bug: 418047052 Original-Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/1279146 Original-Revision: 110dec66bce4cfbc601bc08b711c81060e0b5dbb GitOrigin-RevId: 4c7ddddb3513b22889070f8be8f4d90c43a94211 Change-Id: I2e6af1cb9932f8450c04d5b3eab792d6d6cc836d
charN_t represent code units of different UTF encodings. Therefore the values of 2 different charN_t objects do not represent the same characters.
In order to avoid comparing apples and oranges, we add new warnings to warn on:
We only produce the warning if we cannot establish the comparison would be safe through constant evaluation.
The new
-Wimplicit-unicode-conversion
warning is enabled by default.Note that this PR intentionally doesn;t touches char/wchar_t, but it would be worth considering also warning on extending the new warnings to these types (in a follow up)
Additionally most arithmetic operations on charN_t don't really make sense (ie what does it mean to addition code units), so we could add warnings for that.
Fixes #138526